Outlier Detection Using Clustering Methods: a data cleaning application

نویسندگان

  • Antonio Loureiro
  • Luis Torgo
  • Carlos Soares
چکیده

This paper describes a methodology for the application of hierarchical clustering methods to the task of outlier detection. The methodology is tested on the problem of cleaning Official Statistics data. The goal is to detect erroneous foreign trade transactions in data collected by the Portuguese Institute of Statistics (INE). These transactions are a minority, but still they have an important impact on the statistics produced by the institute. The task of detecting these rare errors is a manual, time-consuming task. Our methodology is able to save a large amount of time by selecting a small subset of suspicious transactions for manual inspection, which, nevertheless, includes most of the erroneous transactions. In this study we compare several alternative hierarchical clustering methodologies for this task. The results we have obtained confirm the validity of the use of hierarchical clustering techniques for this task. Moreover, our results when compared to previous approaches to the same data, clearly outperform them, identifying the same level of erroneous transactions with significantly less manual inspection.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Resource-bounded Outlier Detection using Clustering Methods

This paper describes a methodology for the application of hierarchical clustering methods to the task of outlier detection. The methodology is tested on the problem of cleaning Official Statistics data. The goal is to detect erroneous foreign trade transactions in data collected by the Portuguese Institute of Statistics (INE). These transactions are a minority, but still they have an important ...

متن کامل

Outlier Detection Using Extreme Learning Machines Based on Quantum Fuzzy C-Means

One of the most important concerns of a data miner is always to have accurate and error-free data. Data that does not contain human errors and whose records are full and contain correct data. In this paper, a new learning model based on an extreme learning machine neural network is proposed for outlier detection. The function of neural networks depends on various parameters such as the structur...

متن کامل

A Meta analysis study of outlier detection methods in classification

An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism (Hawkins, 1980). Outlier detection has many applications, such as data cleaning, Fraud detection and network intrusion. The existence of outliers can indicate individuals or groups that have behavior very different to the most of the individuals of the...

متن کامل

A Spectral Clustering Based Outlier Detection Technique

Outlier detection shows its increasingly high practical value in many application areas such as intrusion detection, fraud detection, discovery of criminal activities in electronic commerce and so on. Many techniques have been developed for outlier detection, including distribution-based outlier detection algorithm, depth-based outlier detection algorithm, distance-based outlier detection algor...

متن کامل

Support Vector Clustering for Outlier Detection

In this paper a novel Support vector clustering(SVC) method for outlier detection is proposed. Outlier detection algorithms have application in several tasks such as data mining, data preprocessing, data filter-cleaner, time series analysis and so on. Traditionally outlier detection methods are mostly based on modeling data based on its statistical properties and these approaches are only prefe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004